A Generic XML-Based Format for Structured Linguistic Annotation and Its Application to the Prague Dependency Treebank 2.0
نویسندگان
چکیده
In the first part of this technical report we describe our approach to design a new data format, based on XML (Extensible Markup Language) and aimed to provide a better and unifying alternative to various legacy data formats used in various areas of corpus linguistics and specifically in the field of structured annotation. We introduce the first version of the format, called Prague Markup Language (PML). This version has already been employed as the main data format for the upcoming Prague Dependency Treebank 2.0 (PDT). Finally we outline our ideas and proposals for further improvement of PML, based on our current experience with using and processing data in PML format in the PDT 2.0 project. The second part of the technical report contains the state-of-the-art specification of PML. Technická zpráva č. TR-2005-29 Technická zpráva projektu Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů Projekt Informační společnosti Grantové agentury Akademie věd ČR Registrační číslo GA AV ČR: 1ET101120503 Interní kód MFF: 207-14 / 242083
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملPost-annotation Checking of Prague Dependency Treebank 2.0 Data
Various methods and tools used for the post-annotation checking of Prague Dependency Treebank 2.0 data are being described in this article. The annotation process of the treebank was complicated by several factors: for example, the corpus was divided into several layers that must reflect each other. Moreover, the annotation rules changed and evolved during the annotation. In addition, some part...
متن کاملDoes Netgraph Fit Prague Dependency Treebank?
On many examples we present a query language of Netgraph – a fully graphical tool for searching in the Prague Dependency Treebank 2.0. To demonstrate that the query language fits the treebank well, we study an annotation manual for the most complex layer of the treebank – the tectogrammatical layer – and show that linguistic phenomena annotated on the layer can be searched for using the query l...
متن کاملFrom Sentence to Discourse: Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank
The present paper reports on a preparatory research for building a language corpus annotation scenario capturing the discourse relations in Czech. We primarily focus on the description of the syntactically motivated relations in discourse, basing our findings on the theoretical background of the Prague Dependency Treebank 2.0 and the Penn Discourse Treebank 2. Our aim is to revisit the present-...
متن کاملNetgraph Query Language for the Prague Dependency Treebank 2.0
We study the annotation of the Prague Dependency Treebank 2.0 (PDT 2.0) and assemble a list of requirements on a query language that would allow searching for and studying all linguistic phenomena annotated in the treebank. We propose an extension to the query language of an existing search tool Netgraph 1.0 and show that the extended query language satisfies the list of requirements. We demons...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005